scROSHI: robust supervised hierarchical identification of single cells

Abstract Identifying cell types based on expression profiles is a pillar of single cell analysis. Existing machine-learning methods identify predictive features from annotated training data, which are often not available in early-stage studies. This can lead to overfitting and inferior performance when applied to new data. To address these challenges we present scROSHI, which utilizes previously obtained cell type-specific gene lists and does not require training or the existence of annotated data. By respecting the hierarchical nature of cell type relationships and assigning cells consecutively to more specialized identities, excellent prediction performance is achieved. In a benchmark based on publicly available PBMC data sets, scROSHI outperforms competing methods when training data are limited or the diversity between experiments is large.


INTRODUCTION
After more than two decades of technological de v elopment from its earliest attempts ( 1 , 2 ), single cell transcriptomics studies have come of age and are widely used for basic as well as translational r esear ch (3)(4)(5). This is best showcased by the recent explosion of single cell atlases of various organs and organisms (6)(7)(8)(9), as well as the use of single cell transcriptomics for disease investigation ( 10 ). The term 'atlas' describes the result of identifying each and every cell type in the analyzed tissue sample for known cell types and disco vering no vel cell types defined by their transcriptomic phenotype. Performing such cell type annotation manually is often a labor-intensi v e process requiring expert field knowledge, in particular in the presence of closely re-lated, unknown, or novel cell types such as developing or precursor cells.
In recent years, a large number of tools have been de v eloped to automate cell type identification with varying performance, as summarized in a recent benchmark study ( 11 ). The common theme of these tools is that the expression profile of a target cell is compared to known expression profiles of particular cell types, possibly limited to a subset of genes that are relati v ely stab le and highly e xpressed. In order to deri v e e xpr ession featur es pr edicti v e for a cell type, it is commonplace to use unsupervised clustering of a single cell data set and assign the cluster labels to cell types based on biological interpretation. In the next step, this cell type label is interpreted as the ground truth to build a machine learning model that finds the features relevant for cell type prediction.
Howe v er, as will be shown in the course of this work, learning features and performing the classification on the same data can lead to overfitting even if separate training and test data are used, provided they both were acquired under the same experimental condition. As a consequence, the cell type classification uncertainty is underestimated during validation and the true misclassification rate in the test situation is unexpectedly large. In other words, in practice often featur es (i.e. expr ession levels) learned in one study are applied to other studies, e.g. of the same tissue type, with the assumption that the same features enable a robust classification across studies. Howe v er, e xpression values can largely vary between experiments, and thus this assumption can be violated in which case features should not be used across studies. As a consequence, methods dependent on training data are again prone to an increased misclassification rate w hen a pplied to new data.
Another challenge in cell type classification is that sometimes the number of possible candidate cell types is large, which tends to increase the misclassification rate. Howe v er, mostly the cell types are related because they are the product of dif ferentia tion from a smaller number of precursor cell types.
To address these challenges, we present s ingle c ell RO bust S upervised H ierarchical I dentification of cell types (scROSHI), which utilizes a-priori defined cell type-specific gene list and does not r equir e training or the existence of annota ted da ta. scROSHI is independent of an y inf ormation on expression levels of the cell type-specific gene lists, and thus less prone to overfitting to any particular data set. In addition, scROSHI respects the hierarchical nature of cell type relationships due to differentiation within a lineage and it assigns cells consecuti v ely to more specialized identities. This allows to distinguish e v en closely related and expression-wise similar cell types. Taken together, scROSHI achie v es e xcellent prediction performance, which we show case b y comparing the performance of scROSHI with three existing tools that scored among the best in a recent benchmark study ( 11 ). To capture a realistic scenario, we utilize three annota ted da tasets and a ppl y methods across those data, i.e. cell typing is performed on a dataset different from the training set. We show that scROSHI outperforms the competing methods when the training da taset dif fers from the da ta tha t is evalua ted for cell typing.
Taken together, scROSHI is a transpar ent, interpr etable, and robust cell type classification a pproach particularl y useful when previous knowledge about cell type-specific genes is available but annotated training data is scarce. scROSHI is available as an R package and can thus be seamlessly integrated into single cell analysis workflows.

MATERIALS AND METHODS
The key idea of scROSHI is conceptually simple: it r equir es a list of cell types expected in a sample, and for each of those cell types a list of genes expected to be cell-type specific (or in the minimum prominently expressed in only one cell type). Based on this inf ormation, f or each cell and for each cell type scROSHI compares the expression of the cell type-specific genes with the expression of the genes selected for the other expected cell types. The assumption is that a single cell is 100% pure, i.e. is identified by one cell type or another but not a mixture. Then each cell should show high expression of cell type-specific genes for only one cell type, which will be the cell type classified by scROSHI. Provided the observed object is indeed a single cell, this assumption can be violated in two scenarios: (i) the cell type of the cell is not among the list of presented cell types (or the quality of the cell type-specific genes is poor) and (ii) the genes of more than one cell type show high expression (e.g. because two cell types are highly similar). In the first scenario scROSHI will label the respecti v e cell as 'cell type unknown' to indicate that either a novel ('unknown') cell type is present and / or further investigation is warranted. In the second scenario scROSHI will label the cell as 'cell type uncertain', again indicating the need for further investigation, while providing information of which cell types are likely candidates for classification to ease the manual interpretation.

scROSHI: design considerations
The most important design criterion for scROSHI was that the method should be capable of automated classification in the absence of labeled training data. This excluded any machine learning approach that would r equir e training a model. Instead, the method should utilize and rely on the vast amount of validated cell type-specific gene lists availab le from pre vious bulk or single cell experiments, such as the widely acknowledged immune cell type gene lists (also known as ' lm22 ') used by the cibersort algorithm ( 12 ) or the somewha t rela ted r esour ce for single cell melanoma data ( 13 ).
Another important design criterion was that the method should avoid re-training on a dataset currently under investiga tion. W hile, in general, inclusion of training data from a variety of sources is advantageous, at this stage, re-training would lead to overfitting and ther efor e to an overestimation of the prediction performance. Provided the originally chosen gene list was previously validated to be robust against changes in the experimental condition, re-training is not necessary.
This argument can be turned around to provide a strategy on how to arri v e at a suitab le gene list for cell types for which a previously validated set is not available: included as cell type specific genes, i.e. genes that are highly expressed in the target cell type, should be those genes that have this property independent of tissue type, sample type (i.e. cultured immortalized cells, cultured primary cells, tissue biopsy), detailed setup (culture medium, organism), or patient characteristics (gender, ethnicity, age). The larger the di v ersity of the test data, the more robust and broadly applicable the final gene list.
Gi v en a set of cell type-specific genes and without the need for training a model, one possibility to assign a cell type from a list of candidates to a target cell is to test for association and choose the one that fits best. On one end of the spectrum when measuring association is the hypergeometric test comparing the proportion of highly expressed genes specific for one cell type with the proportion of highly expressed genes specific for all other cell types. The advantage is that it can almost always be calculated and is robust against expression outliers. On the one hand, it is simple to use because the cell type-specific r efer ence does not have to be known quantitati v ely in the form of an expression profile. On the other hand, it is relati v ely insensiti v e because it completely ignores the quantitati v e nature of the expression profile of the target cell, which is typically available. On the other end of the spectrum when measuring association one can quantitati v ely match the expression profile of a known cell type to the expression profile of the target cell, for instance, using Spearman's correla tion. W hile this appr oach is r ob ust a gainst expression outliers, it is relati v ely e xpensi v e in data availability, i.e. it r equir es knowledge of the gene expression profile of the reference. We chose to follow an intermediate path by performing a quantitati v e and robust test, the Mann-Whitney rank sum test, to compare the expression ranks of the genes specific for one cell type with the ranks of the genes specific for all other cell types. The negati v e log of the test's p-value is then a measure of the association strength between the target cell and this cell type. It can be interpreted as a score for how well the target cell matches the cell type at hand.
Like in most classification problems it is assumed here that each cell belongs to exactly one class of a gi v en set of candidates. Howe v er, scROSHI goes one step further and allows for the introduction of two additional classes, unkno wn and uncer tain , to deal with the unavoidable classification uncertainty.
Another step to improve the classification efficiency is to utilize the hierarchical tree structure that is inherent to cell types due to de v elopmental specialization and ther efor e apply a hierarchical classification approach. Instead of classifying all cell types at once, target cells are first assigned to a smaller number of major cell types and then consecuti v ely to more specialized classes. This way, relati v ely similar cell types can be distinguished provided they belong to different branches in the tree.
The scROSHI workflow 1. Find out which cell types to expect from field knowledge. 2. Obtain validated cell type-specific gene lists from the literature or learn cell type specific genes based on other datasets. Importantl y, onl y the gene names, not the expression, is relevant. 3. Optional: Obtain a hierarchical tree structure to define cell type parent-kin relationships. 4. For each cell i and each cell type j in the first hierarchical le v el, compare the e xpression of the genes specific for this cell type with the expression of the genes selected for the other expected cell types: determine the P -value of a onesided Mann-Whitney test of the Null hypothesis that the expression rank sum of the genes specific for this cell type j is the same as or smaller than the rank sum of the genes specific for any other cell type in the list.
The alternati v e hypothesis is that the rank sum of the genes specific for this cell type is larger. 5. Compute the normalized negati v e log of the P -value for each cell i and each cell type j , respecti v ely.
Interpret the result as a score for how well the cell ma tches tha t cell type. 6. Assign the cell type label with the highest score to the cell. 7. If none of the scores is above a certain threshold, do not assign a cell type label to the cell but assign it to the class 'unknown'. 8. If the ratio between the largest and the second largest score is below a certain threshold, do not assign a cell type label to the cell but assign it to the class 'uncertain'. 9. Repeat 4 to 7 for the second hierarchical le v el, and so on. Cells that have been classified as 'unknown' or 'uncertain' in the first iteration are included in the next iteration to allow classification into next level cell types.
scROSHI takes as input the gene x cell count matrix, either with raw or normalized counts (Figure 1 A). scROSHI is robust to the choice of normalization and / or transformation method, because the cell type score is based on ranks rather than on the actual values. In our studies we typically use sctransform ( 14 ), which corrects unwanted biases using regularized negati v e binomial regression. In general, it is advised that the scale of the input data matches the scale at which the cell type-specific gene lists were generated.
The second ingredient required for cell type classification is a collection of cell type-specific gene lists (Figure  1 B). The selection of cell types to expect will depend on the nature of the sample. It is recommended to adapt the cell type selection to keep classification specificity high whene v er possib le: having closely related cell types in the candidate list may sometimes be r equir ed but should be avoided if possible.
Ther e ar e a number of r esour ces available containing curated r efer ence datasets, mostly assembled from bulk RNAseq or microarray data of sorted cell types. Examples are the C8 set of the MSigDB collection ( 15 ), the lm22 immune cell list of cibersort ( 12 ), the BioGPS Human Cell Type and Tissue Gene Expression Profiles collection ( 16 ) from harmonizome ( 17 ), or the Bioconductor ( 18 ) package celldex ( 19 ). These r efer ences ar e often good enough for most applications provided that they contain the cell types that are expected to be present in the da ta a t hand. For our contribution to the Tumor Profiler Study ( 20 ), working on melanoma patient biopsy samples, we used the curated gene list fr om Tir osh et al. ( 13 ) in combination with the immune cell list of cibersort.
In cases where quantitati v e cell type-specific r efer ence profiles are available they can be used as is or they can be binarized to obtain cell type-specific gene lists (Figure 1 B). They should contain genes that show little variability and ar e highly expr essed in the target cell type and have zero or weak expression in all other cell types. The gene lists do not need to be e xclusi v e, i.e. the same gene can appear in different cell type lists, but the overlap between cell types should be kept small. Obviously, the more similar two cell types are, the larger the overlap between their specific gene lists will be. In addition, gi v en the sparsity of single cell count data, gene lists with only a few members will have lower sensitivity compared to larger lists.
The third ingredient to scROSHI is a hierarchical tree structur e defining par ent -kin r elationships between cell types. The purpose of this tree is to classify cells first into a small number of coarse-grained cell type superfamilies and then consecuti v ely into more and more specialized, fine-grained cell (sub-)types (Figure 1 E). This way, the number of possible candidate cell types in each step is much smaller than the total number of candidate cell types thus reducing the possibility of false classification. Mor eover, the thr esholds for unknown and uncertain classes can be chosen to fit the detailed cell type similarity distribution in each branch to optimize classification efficiency.
With the three inputs (i) count matrix, (ii) cell typespecific gene lists and (iii) hierarchical relationships between cell types, scROSHI performs the cell type score assignment and classification. For each cell and each cell type, a one-sided Mann-Whitney U-test is performed. The Null hypothesis is that the expression rank sum of the genes specific for this cell type is the same as or smaller than the rank sum of the genes specific for any other cell type in the list. The alternati v e hypothesis is that the rank sum of the genes specific for this cell type is larger. The normalized negati v e log of the p-value for each cell-cell type pair is interpreted as a score how well the cell matches the respecti v e cell type (Figure 1 C). If none of the scores is above a certain threshold, no cell type label is assigned to the cell but the class 'unknown'. Also, if the ratio between the largest and the second largest score is below a certain threshold, again no cell type label is assigned to the cell but the class 'uncertain'. Both categories, 'unknown ' and 'uncertain ', can reflect popula tions tha t are not included in the list of a priori selected cell types, thus potentially indicating 'novel' cell types (or poor quality of the cell type-specific gene lists). These two categories therefore help to avoid misclassification by explicitly considering classification uncertainty and moreover point out cell popula tions tha t r equir e further investigation. The choice of the two thresholds can be made ad hoc based on visual inspection of the results or consistency with other methods for unlabeled data, or based on an optimization scheme by minimizing the classification cr oss-entr opy when gr ound truthlabeled data is available. In general, the higher the difference between the cell types, the more stringent can the thresholds be chosen.
Taken together, these steps facilitate an enrichment of the pure data-dri v en description of the single-cell data ( Figure  1 D) with biological meaning (Figure 1 F).

Benchmarking
A detailed description of the datasets used, their origin, which preprocessing steps were applied, as well as the description of the pipeline and the competing tools is described in the Supplementary Material (Supplementary Tables S1-S5, Supplementary Figures S1-S3).
To briefly summarize, we used public datasets with a similar cell type composition to benchmark scROSHI against high profile competitor methods. Data from three peripheral blood mononuclear cell experiments were retrie v ed, one from an adult human in which the cell types were pre-sorted (Zheng sorted set), and one each of an adult (Adult set) and a newborn (Newborn set). Hence, the three sets are similar in content but differ in experimental setting and donor age.
We defined a common set of matching cell type labels across the three datasets for comparisons between datasets (see Supplementary Material for further details on the ground truth dictionary).
Based on a pre vious benchmar k of automatic cell identification methods ( 11 ), we decided to compare scROSHI to three front runners: support vector machine (SVM), random forest (RF) and GARNETT ( 21 ). The main difference between scROSHI and its competitors is the fact that they use part of the data to train a model whereas with scROSHI there is no training involved once the cell type-specific gene lists are selected. While SVM and RF can capture nonlinear relationships between the explanatory features (gene expression) as well as interactions between them, GAR-NETT is based on a penalized multivariate generalized linear prediction model (GLMNET). All methods, includ-ing scROSHI, were used under standard conditions, with default parameter settings. We trained a model for RF, SVM, and GARNETT and evaluated the performance of the classifiers by a ppl ying a 5-f old cross-validation f or each dataset. The f olds were split in a stratified manner in order to keep equal proportions of each cell population in each fold. We used the same training and testing f olds f or all classifiers. scROSHI and Garnett r equir e a cell type specific gene list. We used a list of cell type specific genes based on pre vious pub lications ( 12 , 22 ) for scROSHI and a garnett-optimized marker list ( check markers() function from the garnett package v.0.2.17) for Garnett. In addition, the following criteria were supplied to scROSHI for classifying a cell as unknown or uncertain . A cell is labeled unknown if none of the P -values is below 0.05 and uncertain if the ratio between the smallest and the second smallest P -values is above 0.1 (major cell type) or 0.8 (subtype). These thr esholds wer e chosen as the default settings when designing scROSHI in the context of profiling tumor samples from the Tumor Profiler study ( 20 ).
Validation scheme. Each of the three datasets was split in training, validation, and testing sets. Three major validation runs were performed in which each of the three datasets served as the training / validation set. After the final model was obtained, it was tested once 'in set' on the testing set that came from the same experiment as the training data, and two times 'out of set' on the two remaining sets from which the model has not yet seen any data. scROSHI was tested by the same scheme. Further details on the validation scheme can be found in the Supplementary Material.

Copy number variation estimation
To pre-process scRNA-seq data from the Tumor Profiler Study, we used a procedure based on standard quality control measures ( 23 ). First, to retain only high quality cells, we removed cells with fewer than 700 expressed genes and 1500 total read counts detected. Second, to avoid contamination by dying cells while retaining as many informati v e cells as possible, we filtered out cells with more than 35% of read counts coming from mitochondrial genes ( 24 , 25 ).
To distinguish normal from malignant cells, we inferred large-scale copy number variations (CNVs) from the gene expression data using inf er cnvp y ( https://github.com/icbilab/infer cnvp y ). We ran infercnvpy on e v ery sample individually using T cells , B cells , Endothelial cells and Macrophages as r efer ence cells. The gene ordering file containing the chromosomal start and end position for each gene was generated from the human GRCh37 assembly. To reduce the noise le v el, we only used genes that had a mean read count greater than 0.1.
We then used an approach based on hierarchical clustering of single cell copy number profiles to detect cells with and without CNVs. After calling CNVs, we used scipy 's implementation of hierarchical clustering with Ward linkage ( 26 ) to obtain a dendrogram of the CNV profiles. By definition, each node in a dendro gram onl y had two child nodes that r epr esented a cluster of clusters, except for leaf nodes that r epr esented a cluster of cells. Each cell was annotated as malignant or non-malignant using scR OSHI' s cell type annota tions. Starting a t the root node, we then iterati v ely assigned a CNV status to the nodes according to the composition of their subtrees. Specifically, a node and all nodes in its subtree were annotated as presenting no CNVs if both its subtrees contained at least 60% of non-malignant cells. We traversed the dendrogram until we reached all nodes or a maximum depth of fiv e in the dendrogram. Finally, a cell was assigned the 'no CNVs' status if it belonged to a leaf node that had been annotated as not presenting CNVs. All remaining cells were annotated as showing CNVs.

Perf ormance ev aluation
We compared the performance of scROSHI on test datasets with the performance of supervised methods that had been trained with the test dataset (intra-dataset evaluation) and that had been trained with a different dataset (inter-dataset evaluation). Ther e wer e thr ee types of classifiers: ( 1 ) prior knowledge method (scROSHI) for which a cell type specific gene list is r equir ed. ( 2 ) Supervised methods (RF, SVM), which r equir e a training dataset labeled with corresponding cell labels. ( 3 ) Combined method (GARNETT), which r equir es both a cell type specific gene list and a training da taset. We calcula ted the percentage of unlabeled cells across all cell populations per classifier. Further, we calculated the accuracy of only major cell types for scROSHI and GARNETT, since both methods perform a hierarchical cell typing with major and subtype labels (Supplementary Table  S5). Additionally, we determined the proportions of cells that only have a major cell type label, cells that have label 'unknown', or are unclassified. Figure 2 shows the overall results of the inter and intrada taset evalua tion. Generally, scROSHI performs as well as the supervised methods if the supervised methods were trained with the test dataset (scROSHI accuracy: adult 0.823, newborn 0.879, Zheng 0.715). However, scROSHI outperforms the supervised methods if they were trained with another dataset --in this case we observed a lower accuracy and / or a higher amount of unlabeled cells for all supervised methods, a consequence of overfitting to the training data. The supervised methods perform better if they were trained with a dataset that is closer to the test dataset (e.g. training da ta: Adult; test da ta: Newborn) but there is a strong decrease in performance if the test data is dissimilar (e.g. training data: Adult; test data: Zheng).
The subtype classification on the Zheng dataset was challenging for all classifiers (scROSHI accuracy: 0.715). Howe v er, the accuracy of the major cell type label was 0.952 for scROSHI indica ting tha t e v en if it was not possible to find the correct subtype label the correct major cell type label could usually be determined. Moreov er, e v en though the fraction of unknown cells was slightly increased for scROSHI in the Zheng dataset, considerably elevated levels were observed for the ML-based methods regardless of whether the Zheng data was involved in training or testing (Figure 2 , black bars in the right column and the bottom row, respecti v ely). Ther e wer e two cases wher e on first glance the out-oftraining performance of the RF model was comparable (training Zheng, test Adult) or e v en better (training Newborn, test Zheng) than scROSHI. Howe v er, both were accompanied by an unknown cell fraction of more than 70% in RF but below 10% in scROSHI. Essentially, the high apparent accuracy was therefore only achie v ed at the cost of a large proportion of cells that could not be assigned any label.
All in all, our benchmark study shows that scROSHI performs superior to competing tools provided a good quality cell type specific gene list is available and annotated training data are limited or not available, which is often a realistic scenario in early stage projects. Moreover, the good performance is achie v ed with v ery reasonab le amount of resources. For example, it took less than 35 s to classify 2000 cells expressing 3368 genes into 11 cell types using 6 GB RAM on a standard laptop (i7 Intel processor). And because classification is done independently cell-by-cell, e v en extremely large datasets can be handled by splitting into smaller batches.
Similar to the scoring tool ucell ( 27 ), the cell type score of scROSHI depends only on the relati v e rank of the gene expression signal, does not require normalization, and makes no assumptions about the distribution of the signal. But, because scROSHI utilizes the hierarchical nature of cell identities, it can outperform its competitor when a sample contains similar cell types that deri v e from different branches of the lineage tree. scROSHI was de v eloped for 10xGenomics mRNAseq data of tumor patient samples but there is no known limitation to use it on any other modality or or ganism. Ho we v er, it is ideal if the cell type-specific gene lists were defined from results of the same technology as the data at hand.
One possibility to improve the performance of the machine learning tools, i.e. the accuracy on unseen data, might be to train them on a more di v erse data set. Yet, because training on accuracy does not learn causal features for cell type identity, this approach by design does not lead to a uni v ersall y a pplicable model and the performance will still be lower on unseen data than in the validation set, due to overfitting.
The hierarchical scheme in scROSHI, to successi v ely classify cells first into more coarsely grained parent cell types, followed by more and more fine-grained sibling cell subtypes within each parent cell type, reduces the classification complexity in each branch of the tree, potentially reducing the classification error rate in turn. Moreover, the thresholds for unknown and uncertain classes can be tailored to the detailed cell type similarity distribution or count matrix sparsity within each branch.

Consistency with estimations of copy number alterations
In addition to these benchmark datasets with known ground truth but relati v ely simple cell type composition we used scROSHI for cell type identification in clinical samples, i.e. biopsies from melanoma patients participating in the Tumor Profiler Study ( 20 ). In these samples the cell type composition can vary considerably from patient to patient depending, for instance, on the biopsy location, and is more complex to start with. No ground truth was available for such clinical samples, thus we evalua ted the classifica tion results by comparison to single cell CyTOF cell type composition analysis on the same samples ( 28 ) and by consistency with copy number variation (CNV) estimations ( Figure 2 ). The rationale is that only tumor cells are expected to harbor any CNVs, and thus CNVs can be used to distinguish tumor cells from non-tumor cells.
The thr ee r epr esentati v e samples in Figure 3 A-C show a di v erse cell type composition, as illustrated by the twodimensional UMAP r epr esentation based on gene expr ession in the top row. CNV states appear nearly e xclusi v ely in cells identified as melanoma cells, the only malignant cells present (insets). In Figure 3 bottom row, the focus is shifted to UMAP r epr esentations based on CNV states, where all non-malignant cells form a single cluster and malignant cells one or more separate clusters. In the sample shown in Figure 3 A, a few cells located in the melanoma cluster are mis-classified as cancer associated fibroblasts (CAFs, filled purple circles), possibly a consequence of an increased copy number in melanoma cells loca ted a t some CAF specific genes and / or copy number decrease in some melanoma specific genes. The cell type composition in these samples is dom-

Unexpected cell types
As we have introduced the label 'unknown' into scROSHI when none of the classification scores of the list of candidate cell types was high, we investigated whether this would empower scROSHI to recognize that there is an unexpected cell type present in a sample.
We simulated the situation that there is an unknown cell type in a sample by removing one cell type from the candidate cell type list. As a starting point, we used a sample from the Tumor Profiler Study ( 20 ) with se v eral different cell types that could be identified (Figure 4 A). Then we removed the genes specific for three cell types (Plasmacytoid dendritic cells (pDC), T cells, Melanoma cells) and repeated the analysis for each case (Figure 4 B-D). All previously identified pDCs were classified as 'unknown' when excluded from the candidate list, as expected (Figure 4 B, bottom right corner).
When T cells were missing in the candidate list, a small proportion was mis-classified as dendritic cells or plasma cells but the vast majority was correctly labeled 'unknown' (Figure 4 C). Moreover, the cells now (mis-) labeled as dendritic cells are sparsely scattered across the entire former T cell popula tion ra ther than forming a compact cluster or region, which would be expected if they belong to a welldefined cell type or subtype. This observation should raise suspicion and trigger further investigation as it reflects the possibility that the cell type-specific genes do not r epr esent the profile of a distinct cell population observed in this particular study.
In contrast, when melanoma cells wer e r emoved from the candidate list, a considerable proportion was mis-labeled (Figure 4 D). One particular subpopulation on the left hand side of the melanoma cluster appears to share expression features with cancer associated fibroblasts (CAFs), whereas another subpopulation on the right hand side of the melanoma cluster seems to share some similarity with macrophages. At the same time the relati v ely large proportion of 'unknown' cells in the center of the cluster indicates that the cell type candidate list is incomplete or otherwise not suitable for this kind of data. A possible explanation for this observation may be the fact that tumor cells can shar e expr ession featur es with other cell types by exploiting cellular plasticity and de-dif ferentia tion programs ( 29 ).
To summarize this part, most of the cells for which the cell type specifics were excluded from the candidate list were labeled as 'unknown' while a small proportion was misclassified. This procedure outlines how scROSHI may serve as a tool to detect novel cell types that were not expected to be in the sample under investigation.

CONCLUSION
Cell type identification is a critical, yet challenging, step in single cell transcriptomics analysis. Although various machine learning based methods for cell typing are available, the necessity to learn features on adequate training data is prone to overfitting and also challenging in practice, in particular for studies on novel experimental conditions. Here, we have presented scROSHI, a novel supervised cell type classification method independent of training data but instead based on a priori defined cell type cell type specific genes. In a benchmark study and on clinical data from tumor samples, we have shown that scROSHI is useful, robust, versatile, and competitive to existing methods under real-life scenarios.

DA T A A V AILABILITY
Availability of benchmark data: ask at scp-support@ broadinstitute.zendesk.com .
The three data sets from the TumorProfiler Study are available upon request at info@tu-pro.ch , and according to the data sharing policy at the w e b site https://ethnexus.github.io/tu-pro w e bsite/data/ (in preparation by the consortium). In the meantime, we have posted the raw count matrices on Zenodo ( https://doi.org/10.5281/zenodo. 6577402 ).

SUPPLEMENT ARY DA T A
Supplementary Data are available at NARGAB Online.